July 10, 2019

DNA Methylation (5mC)

  • chemical modification to an individual DNA base (Cytosine)
  • doesn’t change coding sequence

image source: https://www.epigentek.com

DNA Methylation (5mC) at CpGs

The 5th base?

Preview: role in gene regulation

Evolution of methylation assays

Analysis of methylation arrays

  • Methylation microarray (e.g. Infinum 450K, Epic 850K) analysis has similarities to expression microarray analysis
    • Signal is continuous
    • 1 sample per array
  • Beta values: \[\beta = \frac{\text{methylated intensity}}{(\text{meth} + \text{unmeth intensity})}\]
  • Analysis: regression on \(logit(\beta)\)
  • Packages: minfi, bumphunter

image: https://en.wikipedia.org/wiki/Illumina_Methylation_Assay

Analysis of MeDIPseq

Analysis of bisulfite sequencing

  • Bisulfite sequencing (e.g. WGBS, RRBS)
    • basepair resolution
    • binomial count of methylated/unmethylated reads
  • Analysis: beta-binomial regression
  • Packages: bsseq, dss, dmrseq

drawingdrawing

Bisulfite sequencing

Bisulfite sequencing

Preprocessing

Differential methylation

  • differentially methylated cytosine (DMC) or locus (DML):
    • array: test each probe
    • bisulfite sequencing: test each cytosine
  • differentially methylated region (DMR):
    • array: test groups of neighboring probes
    • bisulfite sequencing: test groups of cytosines

DMC test of bisulfite read counts

  • For cytosine \(i\) and sample \(j\) in condition \(s\), we have:
  • \(M_{ij}\) reads corresponding to methylation
  • \(C_{ij}\) total reads covering \(i\)
  • \(p_s\) be the methylation probability in condition \(s\)
  • Let \(M_{ij} | C_{ij} \sim Binom(C_{ij}, p_{s})\)
  • How to test whether \(p_1 = p_2\)?

Binomial (Logisitic) regression

  • Generalized linear model for probability of success \(p\): \[log(\frac{p}{1-p}) = \boldsymbol{X\beta}\]
  • Link function \(g(p) = log(\frac{p}{1-p})\) describes relationship between proportional response and linear predictor
  • No closed form; fit with iterative ML estimation
  • Interpret coefficients on original scale with inverse link function \(g^{-1}\): \[ p = \frac{e^{\boldsymbol{X\beta}}}{1+e^{\boldsymbol{X\beta}}} \]
  • In R: glm(cbind(successes, failures) ~ x, family="binomial")

Binomial regression with overdispersion

  • In ordinary logistic regression, \(p\) is assumed to be constant for all samples with in group
  • To model overdispersion, we might want to allow this quantity to vary
  • For example, let \[p_s \sim Beta(\alpha_s, \beta_s)\]
  • Overdispersion parameter \(\phi = \frac{1}{\alpha_s + \beta_s + 1}\) contributes to increased variance
  • In R: aod::betabin(cbind(successes,failures) ~ x, ~ x)

Pitfalls of binomial regression

  • What happens to link function \(log(\frac{p}{1-p})\) if \(p=0\) or 1?
    • binomial regression unstable for fully methylated or unmethylated cytosines
  • Computationally intensive to fit model at every cytosine
  • DSS: Park & Wu 2016 (https://doi.org/10.1093/bioinformatics/btw026)
    • Differential methylation under general experimental design
    • Alternate link function: \(arcsine(2p-1)\)
    • Approximate fitting with Generalized Least Squares (GLS)

Generalized Least Squares (GLS) in a nutshell

  • Hybrid of linear regression and generalized linear regression
  • Pro: stable & closed form estimates (fast)
  • Con: approximate
  • Key idea: flexible covariance structure allows for specification of approximate beta-binomial error

Individual cytosine differences

CpG 1

CpG 2

Previous approaches: grouping significant CpGs

FDR at the region level

\[ \text{False Discovery Rate (FDR)} = E\Big[\frac{FP}{FP + TP}\Big]\]

  • \(FDR_{CpG} = 2/10 = 0.20\)
  • \(FDR_{DMR} = 1/2 = 0.50\)

Accurate inference of DMRs

  • Korthauer et al. 2018 (https://doi.org/10.1093/biostatistics/kxy007)
  • Key ideas:
    • model methylation signal over region
    • permutation to acheive accurate FDR control
  • dmrseq: accurate inference for detection of differentially methylated regions (Bioconductor)

dmrseq: 2-stage approach

dmrseq output

Review: role in gene regulation

Correlation or causation?

First genome-wide study of causality

  • “Promoter DNA methylation is generally not sufficient for transcriptional inactivation”

Design of Ford et al. Study

Conclusion: methylation not generally sufficient

Reanalysis with dmrseq

  • original study used DSS (approach grouping individual CpGs)
  • main question : does accurate inference of methylation increase make a difference in the conclusion?

Results of the reanalysis

Results of the reanalysis

Significance: statistical \(\iff\) biological